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orated model of functional preferences on Cf 
elements which constrains the set of possi- 
ble antecedents according to topic/comment 
patterns. 

8 Conclusion 

In this paper, we have outlined a model of 
text ellipsis parsing. It considers conceptual 
criteria to be of primary importance and pro- 
vides a proximity measure in order to assess 
various possible antecedents for considera- 
tion of proper bridges (Clark, 1975) to ellipti- 
cal expressions. In addition, functional con- 
straints based on topic/comment patterns 
contribute further restrictions on proper el- 
liptical antecedents. The particular advan- 
tage of our approach lies in the integrated 
treatment of text-level ellipsis within a sin- 
gle coherent grammar format. 

The anaphora resolution module (Strube 
and Hahn, 1995) and the ellipsis handler have 
both been implemented in Smalltalk as part 
of a comprehensive text parser for German. 
Besides the information technology domain, 
experiments with our parser have also been 
successfully run on medical domain texts, 
thus indicating that the heuristics we have 
been developing are not bound to a partic- 
ular domain. The current lexicon contains 
a hierarchy of approximately 70 word class 
specifications with nearly 2.500 lexical en- 
tries and corresponding concept descriptions 
available from the LOOM knowledge rep- 
resentation system (MacGregor and Bates, 
1987) — 650 and 400 concept/role specifi- 
cations for the information technology and 
medicine domain, resp. 
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stances, viz. Pci-MoTHERBOARD, Compaq, 
and Lte-Lite-25, which form the current 
forward-looking centers. The terminologi- 
cal reasoner attempts to determine a role 
chain between one of these instances and 
Cpu. Only Pci-Motherboard passes this 
test successfully. Both concepts can be linked 
via the has-cpu role at unit length (depth) 1. 

Consider a slight variation of text frag- 
ment (1): If PC I- Motherboard in (1) is re- 
placed by LCD-Display (as in (2)) the re- 
sult of the ellipsis resolution differs from the 
previous example. Since due to conceptual 
constraints LCD-DISPLAY cannot be consid- 
ered a proper antecedent of Cpu, Lte-Lite- 
25 is selected as the only valid antecedent. 
The corresponding conceptual link (see the 
KB listing in Table 7) consists of the rela- 
tion composition Lte-lite-25 - has-central- 
unit - Central-Unit - has-motherboard - 
Motherboard - has-cpu - Cpu (having 
depth 3). 

(2) Compaq bestiickt den LTE-Lite/25 mit einem 
LCD-Display. Die CPU hat eine Taktfrequenz 
von 50 Mhz. 

(Compaq equips the LTE-Lite/25 with a LCD- 
display. The CPU comes with a clock frequency 
of SOMhz.) 



Table 7: Transcript from the Domain Knowl- 
edge Base for Text Fragment (2) 

7 Comparison with Related 
Approaches 

As far as text-level processing is concerned, 
the framework of DRT (Kamp and Reyle, 
1993), at first sight, constitutes a particu- 
lar strong alternative to our approach. The 



complex machinery of DRT, however, might 
work well for anaphora, but would fail if 
non- anaphoric, e.g., elliptical text phenom- 
ena had to be interpreted (but see Wada 
(1994) for a recent attempt to deal with re- 
stricted forms of ellipsis in the DRT context). 
This shortcoming is simply due to the fact 
that DRT is basically a semantic theory, not 
a fuU-fiedged model for text understanding. 
In particular, it lacks any systematic connec- 
tion to well-developed reasoning systems ac- 
counting for conceptual knowledge and spe- 
cific problem- solving models which underlie 
the chosen domain. 

As far as proposals for the analysis of tex- 
tual ellipsis are concerned, none of the stan- 
dard grammar theories (e.g., HPSG, LFG, 
GB, CG, TAG) covers this issue. This is not 
surprising at all, as their advocates pay al- 
most no attention to the text level of linguis- 
tic description (with the exception of several 
forms of anaphora) and also do not take con- 
ceptual criteria as part of grammatical de- 
scriptions seriously into account. Conven- 
tional grammar-based approaches seem en- 
tirely infeasible unless one were willing to du- 
plicate the knowledge which has been gath- 
ered in a DKB at the grammar level in terms 
of, say, a highly diversified "case" grammar 
system (e.g., as advocated by Kehler (1993) 
and Grover et al. (1994)). Unfortunately, 
this leaves us with the same problems (as 
those already discussed in Section 3) under a 
different cover, since relations among differ- 
ent (sub)types of cases then had to be scru- 
tinized. 

Actually, only few systems exist which deal 
with textual ellipsis. Those which take care 
of it do so in an often ad hoc way. As an ex- 
ample, consider the PUNDIT system (Palm- 
er et al. 1986) which provides a rough imple- 
mentation solution for a particular domain. 
This work shares a lot of similarities with our 
approach (e.g., the use of focus mechanisms 
(Grosz and Sidner, 1986)). But we con- 
sider our proposal superior, since it provides 
a more general, implementation-independent 
treatment at the grammar level. The ap- 
proach reported in this paper also extends 
our own previous work on textual ellipses 
(Hahn, 1989) by the incorporation of an elab- 



(preferred-cb (LCD-DISPLAY-0008 COMPAQ 
LTE-LITE-25) CPU-0009) 

LCD-DISPLAY-0008 - depth:! NIL 
COMPAQ - depth:! NIL 
LTE-LITE-25 - depth:! NIL 

LCD-DISPLAY-0008 - depth: 2 NIL 
COMPAQ - depth: 2 NIL 
LTE-LITE-25 - depth: 2 NIL 

LCD-DISPLAY-0008 - depth: 3 NIL 
COMPAQ - depth: 3 NIL 
LTE-LITE-25 - depth: 3 TRUE 

==> (LTE-LITE-25 (COMPOSE HAS-CENTRAL-UNIT 
HAS-MOTHERBOARD HAS-CPU) CPU-0009) 



* searchAntecedent message 




[einem] [SO MHz] 



Compaq bestiickt den LTE-Lite/25 mit einem PCI-Motherboard. Die CPU hat eine Taktfrequenz von 50 MHz. 
Compaq equips the LTE— Lite/25 with a PCI-motherboard, The CPU comes with a clock frequency of 50 MHz. 



Figure 3: Sample Parse : 

1. In phase 1, the message is forwarded 
from its initiator to the sentence delim- 
iter of the preceding sentence, where its 
state is set to phase 2. 

2. In phase 2 the sentence delimiter's ac- 
quaintance Cf is tested for the predicate 
PreferredConceptuaWridge. 

Note that only nouns and pronouns are ca- 
pable of responding to the SearchTextElUp- 
sisAntecedent message and of being tested 
as to whether they fulfill the required cri- 
teria for an elliptical relation. If the ellip- 
sis predicate PreferredConceptuaWridge suc- 
ceeds, the determined antecedent sends a 
TextEUipsisAntecedentFound message to the 
initiator of the SearchTextEUipsis Antecedent 
message. Upon receipt of the Antecedent- 
Found message, the discourse referent of the 
elliptical expression is conceptually related to 
the antecedent's concept via a has-part-tjpe 
relation, thus preserving cohesion at the con- 
ceptual level of text propositions. We will 
now illustrate (cf. Fig. 3) the protocol for es- 
tablishing elliptical relations, referring to the 
already introduced text fragment (1) which 
is repeated at the bottom line of Fig. 3. 

The second sentence of (1) contains the 
definite noun phrase die CPU. Since CPU 
does not subsume any word at the concep- 
tual level in the preceding text (cf. Fig. 1), 
the anaphora test fails; the definite noun 
phrase die CPU has also not been inte- 
grated in terms of a partonomic relation into 
the conceptual representation of the sentence 
as a result of the semantic interpretation. 
As a consequence, a SearchTextEUipsis An- 



Text Ellipsis Resolution 

tecedent message is created by the word ac- 
tor for CPU. That message is sent directly 
to the sentence delimiter of the previous 
sentence, where the predicate PreferredCon- 
ceptuaWridge is evaluated for the acquain- 
tance Cf. As one of the relevant heads 
can be tested successfully (the correspond- 
ing concept Pci-MOTHERBOARD is related 
to Cpu via the role has-cpu), a TextEUipsis- 
AntecedentFound message is sent to the ini- 
tiator of the SearchAntecedent message. An 
appropriate update links the corresponding 
concepts via the role has-cpu and, thus, co- 
hesion is established at the conceptual level 
of the text knowledge base. 

In order to illustrate our approach under 
slightly varying conditions, let us discuss ex- 
amples of ellipsis resolution with focus on the 
DKB fragment depicted in Fig. 1. We ab- 
stract from the event-oriented description of 
the parsing process itself and concentrate in- 
stead on the computations being performed 
within the knowledge representation system. 
Consider text fragment (1) again. The rel- 
evant knowledge base (KB) operations (see 
the listing in Table 6) are caused by the 
evaluation of the predicate PreferredConcep- 
tuaWridge and are performed on three in- 

(preferred-cb (PCI-MOTHERBOARD-0004 
COMPAQ LTE-LITE-25) CPU-0005) 

PCI-MOTHERBOARD-0004 - depth:! TRUE 

==> (PCI-MOTHERBOARD-0004 HAS-CPU CPU-0005) 



Table 6: Transcript from the Domain Knowl- 
edge Base for Text Fragment (1) 



More specifically, there must be a connected 
path linking the two concepts under consid- 
eration via a chain of conceptual roles. 

Proximity Score (from-concept, to-concept) 




if 3 xo, Xn e T: 3 ro, r„_i G TZ: 
xo = from-concept A x„ = to-concept 
A V i G [0, n-1]: (xj, ij, Xj-i-i) G permit 
oo else 



Table 4: Determining the Conceptual Dis- 
tance between Two Concepts 

Finally, the predicate PreferredConceptu- 
alBridge (Table 5) combines both criteria. A 
lexical item x is determined as the proper an- 
tecedent of the elliptical expression y if it is a 
potential antecedent and if there exists no al- 
ternative antecedent z whose Proximity Score 
either is below that of x or, if their Proxim- 
ityScore is identical, whose strength of pref- 
erence under the TC relation is higher than 
that of x: 



Table 5: Determining the Preferred Concep- 
tual Bridge for an Elliptical Expression 

The mechanism we provide for the resolu- 
tion of text-level ellipses is strongly rooted 
in the structural properties of KL-ONE-type 
terminological knowledge representation lan- 
guages (MacGregor, 1991). Its focus is on ag- 
gregation or mereological relations (the most 
general form being part-of, but more refined 
conceptual roles usually must be supplied)^. 
Our metrical criterion favors the most prox- 

^The need for complex semantic data structures 
for proper ellipsis resolution has also been recognized 
by other computational linguists, e.g., by Kehler 
(1993), whose discourse copying algorithm uses a 
Davidsonian-style event representation which is close 
to the notion of frames. However, his use of event 
structures, or similarly, the event types of the unifi- 
cation discourse grammar framework to which Grover 
et al. (1994) refer are rather paraphrases of com- 
mon case role labels than sophisticated conceptual 
attributes (both papers aim at VP ellipsis!), and thus 
lack the level of conceptual differentiation needed to 
adequately cope with textual ellipsis. 



imate non-generalization-based link between 
the concept denoted by some already avail- 
able discourse entity (the antecedent) and 
the concept denoted by the currently con- 
sidered lexical item (the elliptical item). In 
case the distances are of equal length func- 
tional considerations complement the con- 
ceptual ones on the basis of the information 
structure of the preceding sentence. 

6 Text Cohesion Parsing: 
Ellipsis Resolution 

The actor computation model (Agha and 
Hewitt, 1987) provides the background for 
the procedural interpretation of lexicalized 
grammar specifications, as those given in the 
previous section, in terms of so-called word 
actors (Schacht et al., 1994). Word actors 
combine object-oriented features with con- 
currency and thus provide a methodolog- 
ically clean platform for inherently paral- 
lel, lexically distributed computations. The 
model assumes word actors to communicate 
via asynchronous message passing. An ac- 
tor can only send messages to other actors it 
knows about, its so-called acquaintances. 

The resolution of textual ellipsis depends 
on the results of the resolution of nominal 
anaphora as well as on the termination of 
the semantic interpretation of the sentence. 
It will only be triggered at the occurrence of 
phrase P 

• when P is non- anaphoric, and 

• when P is not connected at the concep- 
tual level (via a has-part-tjpe relation) 
to some referent denoted in the current 
sentence. 

The protocol level of text analysis encom- 
passes the procedural interpretation of the 
grammatical constraints from Section 5. A 
SearchTextEUipsisAntecedent message is trig- 
gered if a SearchNomAntecedent message (in- 
tended to account for the resolution of nomi- 
nal anaphora) quits without success and the 
semantic interpretation did not produce a 
proper (partonomic) relation at the concep- 
tual level of representation for the elliptical 
phrase. The protocol for establishing cohe- 
sive links based on the recognition of textual 
ellipsis consists of two phases: 



PreferredConceptualBridge (x, y, n) :<J4> 
isPotentialEllipticAntecedent (x, y, n) 
A -i3 z : isPotentialEllipticAntecedent (z, y, n) 
A ( ProximityScore (z. concept, y.concept) 
< ProximityScore (x. concept, y.concept) 
V ( ProximityScore(z. concept, y.concept) 
= ProximityScore(x. concept, y.concept) 
A z >j,c X ) ) 



man have shown that there are no anaphora 
whose antecedents occur as modifiers except 
of those at the right edge of the rheme; there- 
fore they are included in Cf (cf. Fig. 2). 

The middle and the bottom-level of Ta- 
ble 1 are constituted by >p„ and 

^anatype 

> which denote preference relations 

^anafunc 

dealing exclusively with multiple occurrences 
of (resolved) anaphora in the preceding sen- 
tence. >rr„ distinguishes among differ- 
ent forms of anaphora (viz., pronominal, pos- 
sessive, nominal and ellipsis form) and their 
associated preference order, while >„„ 

anafunc 

covers the functionally based preference or- 
der for multiple occurrences of the same type 
of anaphora (i.e., whether they occur as sub- 
ject, direct object, indirect object or ad- 
junct). 

Given these basic relations, we may now 
formulate the composite relation (Table 
2). It states the conditions for the compre- 
hensive ordering of items on C/ {x and y de- 
note lexical heads). 



>Tc := { (x, y) I 
if X and y both represent the same type 
of anaphora 
then the relation >j,^ ^ applies to x and y 
else if X and y both represent different forms 
of anaphora 
then the relation >„„ applies to x and y 

else the relation >TCfj applies to x and y } 

Table 2: Global Topic/ Comment Relation 

5 Grammatical Predicates for 
Textual Ellipsis 

We here build on the ParseTalk model of de- 
pendency grammar, a fully lexicalized gram- 
mar theory which employs default inheri- 
tance for lexical hierarchies (Broker et al., 
1994; Hahn et al., 1994)). The grammar for- 
malism is based on dependency relations be- 
tween lexical heads and modifiers at the sen- 
tence level. The dependency specifications 
allow a tight integration (not a mixture!) of 
linguistic knowledge (grammar) and concep- 
tual knowledge (domain model), thus mak- 
ing powerful terminological reasoning facili- 
ties directly available for the parsing process 
(cf. also Hajicova (1987) in support of this 
view). Accordingly, syntactic analysis and 



semantic interpretation are closely coupled. 

This exposition of the ParseTalk gram- 
mar framework is tailored to the require- 
ments of the resolution of textual el- 
lipses. We assume the following conven- 
tions to hold: C = {Word, Nominal, 
Noun, DetDefinite,...} denotes the set of 
word classes, and isac = {(Nominal, Word), 
(Noun, Nominal), (DetDefinite, Nominal),...} 
C C X C denotes the subclass relation which 
yields a hierarchical ordering among these 
classes. The concept hierarchy consists of 
a set of concept names T = {Computer- 
System, Notebook, Central-Unit,...} 
(cf. Fig. 1) and a subclass relation isajr = 
{(Notebook, Computer-System), (Pci- 
motherboard, motherboard),...} c 
J- X J- . The set of role names TZ = {has- 
part, has-cpu,...} contains the labels of ad- 
mitted conceptual roles. The relation per- 
mit C J- X TZ X J- characterizes the range 
of possible conceptual roles among concepts, 
e.g., (Motherboard, has-cpu, Cpu) g per- 
mit. Furthermore, object. attribute denotes 
the value of the property attribute at ob- 
ject, while head denotes a structural rela- 
tion within dependency trees, viz. a; being the 
head of y. 

The resolution of textual ellipsis is based 
on two major criteria, a structural and a con- 
ceptual one. The structural condition is em- 
bodied in the predicate isPotentialEUiptic- 
Antecedent (Table 3). An elliptical relation 
between two lexical items is here restricted 
to pairs of nouns. The elliptical phrase which 
occurs in the n-th sentence is restricted to a 
definite NP and the antecedent must be one 
of the forward-looking centers of the preced- 
ing sentence. 



isPotentialEllipticAntecedent (x, y, n) :<J4> 
X isac* Noun 
A y isac* Noun 

A 3 z: (y head z A z isac* DetDefinite) 

A X G C/(C/„_i) 

Table 3: Determining a Potential Elliptical 
Antecedent 

The function Proximity Score (Table 4) 
captures the basic conceptual condition in 
terms of the distance between two concepts. 



anaphora >j. 



head rheme >j,^ right edge rheme >j. 



head theme >„ 



head unmarked 



pronominal anaphor 



„„ possessive pronoun >„„ ^^^^^^^^^^^ ^^^^^^^^^ 

^ ^anatype r r j L^anatype ^ ^ ^anatypt 



nominal anaphor >j. 



textual ellipsis 



subject >rp 



direct object >j. 



indirect object >j. 



adjunct 



Table 1: Functional Ranking on the Cf for German according to Topic/ Comment Patterns 



sal and limiting the number of nodes being 
passed — these are the major constraints for 
the creation (and interpretation) of such el- 
liptical relations by the text producer (un- 
derstander). They must be met to preserve 
textual cohesion. 

4 Centering Principles for German 

Conceptual criteria are of tremendous impor- 
tance, but they are not sufficient for proper 
ellipsis resolution. Additional criteria have to 
be supplied in the case of equal role length for 
several alternative antecedents. We therefore 
incorporate into our model various functional 
criteria in terms of topic/comment patterns 
which originate from the (dependency) struc- 
ture of the preceding sentence. The orga- 
nizational framework for this type of infor- 
mation is provided by the well-known cen- 
tering mechanism (Grosz et al., 1995). Ac- 
cordingly, we distinguish each utterance's 
backward-looking center (Cb(Un)) and its 
forward-looking-centers (Cf(Un))- The rank- 
ing imposed on the elements of the Cf re- 
flects the assumption that the first element 
of Cf(Un) is the most preferred antecedent 
of an anaphoric (or elliptical) expression in 
Un+i, while the remaining elements are or- 
dered according to decreasing preference for 
establishing referential links. 

The main difference between Grosz et al.'s 
work and our proposal concerns the crite- 
ria for ranking the forward-looking centers. 
While Grosz et al. (1995) assume (for the 
English language) that grammatical roles are 
a major determinant for the ranking on the 
Cf, we claim that the major determinant 
for German - a language with relatively 
free word order - is the functional informa- 
tion structure of the sentence in terms of 
topic/comment patterns. Accordingly, the 
topic denotes the given information which oc- 
curs at the beginning of the sentence, while 
the comment denotes the new information 



which is given at the end of the sentence. 
All phrases intervening topic and comment 
are called unmarked, because they are irrel- 
evant for the information structure of a sen- 
tence (cf. Fig. 2 for an abstract configura- 
tion schema in terms of dependency gram- 
mar). Note that there are some exceptions to 
this general statement, which relate to syn- 
tactic phenomena like coordination, copula 
sentences, interrogative clauses, some types 
of subordinate clauses, etc. 




Figure 2: Abstract Configuration Schema for 
Topic/ Comment Patterns 

Not only are topic/comment patterns rel- 
evant for the ranking on the Cf but also is 
it important whether an element of the sen- 
tence is anaphoric or not. Anaphoric ele- 
ments are generally ranked higher than any 
non-anaphoric elements, irrespective of the 
topic/comment structure of the sentence in 
which they occur (cf. Hajicovaet al. (1992)). 

The rules holding for the ranking on the Cf 
for German are summarized in Table 1. They 
are organized at three layers. At the top 



level, >r^ 



denotes the basic relation for 



the overall ranking of topic/comment (TC) 
patterns. Accordingly, any anaphoric expres- 
sion in the preceding sentence Un is given the 
highest preference as a potential antecedent 
of an anaphoric (or elliptical) expression in 
Un+i ', the other types of functional configu- 
rations, viz. head rheme, right edge rheme, 
head theme, or unmarked head(s), are acces- 
sible in the given decreasing preference or- 
der. Our studies on expository texts in Ger- 



representational granularities might dif- 
fer among various subworlds (associated 
with different basic categories). 

These principles are but a bottom-line for 
a discipline that might be called epistemo- 
logical engineering. It is currently emerg- 
ing from several attempts to develop a for- 
mal methodology for knowledge acquisition 
(cf. Alexander et al. (1988)), but still re- 
lies on the provision of experiential guide- 
lines for building concrete, non-toy DKBs in 
a terminological (cf. Brachman et al. (1991, 
Section 14.5), Gates et al. (1989, Section 1) 
or Monarch and Nirenburg (1990)) or pred- 
icate calculus language framework (Hobbs, 
1984). The problems one encounters in that 
area are rooted in fundamental philosophical 
problems and are rearticulated by all those 
involved in a metatheory of knowledge rep- 
resentation in the artificial intelligence camp. 
McCarthy (1977), for example, stresses ques- 
tions of observational availability of data and 
correspondence relations between observable 
data and their proper formal representation. 

These are not just abstract philosophical 
debates, as the problems under considera- 
tion turn up in every KE enterprise. Davis 
et al. (1993, pp. 19-21) convincingly argue 
that we are caught in a plethora of ontologi- 
cal commitments which accumulate, at least, 
at three layers — the choice of a particu- 
lar representation format (logic, nets, frames, 
etc.), the force of the major representation 
constructs of this framework (e.g., prototypi- 
cality, defaults, and taxonomic hierarchies as 
far as frame representations are concerned), 
and, finally, the knowledge engineering level 
(which we specifically address by the KE 
principles stated above). At this level, hu- 
man conceptual bias results from (uncon- 
scious) selectivity of observation. But also 
the perspective one chooses to solve the rep- 
resentation problem by (consciously) focus- 
ing on the "relevant" issues adds further bias: 
Which knowledge items should be included in 
the actual representation structures? Where 
do they appear in the hierarchy (again, if 
object-centered representations are chosen)?, 
etc. 

The ellipsis resolution problem, neverthe- 
less, incorporates any of these commitment 



layers and even projects solutions worked out 
at the knowledge layer on the data struc- 
ture or symbol layer of representations. By 
this, we mean the abstract implementation of 
knowledge representation structures in terms 
of graphs and their path connectivity pat- 
terns. At this level, however, we have reasons 
to assume that the proximity metric on which 
we build makes sense. We here draw on 
early work from cognitive psychologists such 
as Quillian (1969) and Rips et al. (1973), or 
more recent research in the parsing domain 
proper, e.g., by Charniak (1986). Their ex- 
periments provide ample evidence that the 
definition of proximity in semantic networks 
in terms of the traversal of typed edges (e.g., 
only via generalization or via attribute links) 
and the corresponding counting of nodes that 
are passed on that traversal is not only com- 
mon practice but also methodologically valid 
for computing conceptual distances. 

The KE principles mentioned above are 
supplemented by the following linguistic reg- 
ularities which hold for textual ellipsis: 

1. Adherence to a Focused Context. 
Valid antecedents of elliptical expres- 
sions occur within subworld boundaries 
(in technical terms, they remain within 
a single knowledge base context, micro 
theory, etc.). Given the above KE con- 
straints (in particular the one requiring 
each subworld to be characterized by the 
same degree of conceptual density), path 
length criteria make sense for estimating 
the conceptual proximity between con- 
cepts. Moreover, only links of a certain 
type are considered for traversals, viz. 
those covered by the part-o/ relation. 

2. Limited Path Length Inference. 

Valid pairs of possible antecedents and 
elliptical expressions denote concepts in 
the DKB whose conceptual relations 
(role chains) are constructed on the ba- 
sis of rather restricted path length units 
(in our experiments, no valid chain ever 
exceeded unit length 5). This corre- 
sponds to the implicit requirement that 
these role chains must be efficiently com- 
putable. 

Remaining within a focal subworld, con- 
straining the relation types for graph traver- 
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Figure 1: Fragment of the Underlying Domain Knowledge Base 
Given sentence (1) and Fig. 1, according to 1 



the convention above Pci-MoTHERBOARD is 
conceptually most proximate to the elliptical 
occurrence of Cpu (due to the direct concep- 
tual role between Motherboard - has-cpu 

- Cpu with unit length 1), while the relation- 
ship between Lte-Lite-25 and Cpu exhibits 
a greater conceptual distance (counting with 
unit length 3, due to the triple composi- 
tion of roles between Computer-System 

- has- central-unit - Central-Unit - has- 
motherhoard - Motherboard - has-cpu - 
Cpu). 

3 Epistemological Engineering for 
Ellipsis Resolution 

Metrical criteria incorporating path connec- 
tivity patterns in network-based knowledge 
bases (i.e., concept graphs) have often been 
criticized for lacking generality and introduc- 
ing ad hoc criteria likely to be invalidated 
when applied to different domain knowledge 
bases (DKB). The crucial point about the 
presumed unreliability of path-length crite- 
ria seems to address the apparent problem 
how the topology of such a network system 
can be "normalized" such that formal dis- 
tance measures relate to intuitively plausible 
conceptual proximity judgments. Though we 
have no formal solution for this correspon- 
dence problem, our proposal tries to elimi- 
nate structural idiosyncrasies by postulating 
two knowledge engineering (KE) principles: 



2. 



Clustering into Basic Categories. 

The specification of the upper level of 
the ontology of some domain (e.g., in- 
formation technology (IT)) should be 
based on a stable set of abstract, yet 
domain-oriented ontological categories 
inducing an almost complete partition 
on the entities of the domain at a com- 
parable level of generality (e.g., hard- 
ware, software, companies in the IT 
world). Each specification of such a ba- 
sic category and its taxonomic descen- 
dents constitutes the common ground 
for what Hayes (1985) calls clusters and 
Guha and Lenat (1990) refer to as mi- 
cro theories, i.e., self-contained descrip- 
tions of conceptually related proposi- 
tion sets about a reasonable portion of 
the commonsense world within a single 
knowledge base partition (context, sub- 
theory). 

Balanced Deepening. The specifi- 
cation of the lower levels of that on- 
tology dealing with concrete objects of 
the domain (e.g., notebooks, laser print- 
ers, hard disks in the IT world) should 
be carefully balanced, i.e., the extrac- 
tion of attributes for any particular cate- 
gory should proceed at a uniform degree 
of detail at each decomposition level. 
This requirement should guarantee that 
any subworld has the same level of rep- 
resentational granularity, although the 



while Carberry (1989) elaborates on multi- 
ple pragmatic criteria involving beliefs, plans 
and goals). While QA ellipsis, when isolated 
from its discourse setting, often tends to be 
ungrammatical or at least fragmentary at the 
surface level, textual ellipsis is characterized 
by entirely grammatical sentences which only 
lack explicit reference to the discourse enti- 
ties already available from the context. 

Surprisingly little efforts have already been 
spent on the design of computational models 
for the analysis of textual ellipsis (cf. Section 
7), although these constructions occur at sig- 
nificant rates in expository texts. Fraurud 
(1990) carried out an experimental study on 
the distribution of (in)definite NPs in such 
texts and found out that 36% of their occur- 
rences can be classified as cases of ellipsis, 
while another 36% belong to the category of 
anaphora. 

2 Constraints on Textual Ellipsis 

Textual ellipsis - at the conceptual level - re- 
lates an elliptical expression to its antecedent 
by conceptual attributes (or roles) associ- 
ated with that antecedent (see (1) below). It 
thus complements the phenomenon of nomi- 
nal anaphora, which are related to their an- 
tecedent in terms of conceptual generaliza- 
tion (cf. Strube and Hahn (1995)). 

(1) Compaq bestiickt den LTE-Lite/25 mit einem 
PCI-Motherboard. Die CPC/ hat eine Taktfre- 
quenz von 50 Mhz. 

(Compaq equips the LTE-Lite/25 with a PCL- 
motherboard. The CPU comes with a clock fre- 
quency of 50 Mhz.) 

We call this phenomenon textual ellipsis 
because in the second sentence a conceptual 
entity that relates the topic of this sentence 
to discourse elements mentioned in the pre- 
ceding one is missing. Hence, the appropri- 
ate conceptual link must be inferred in or- 
der to establish the cohesion of the whole 
discourse (for an early statement of that 
idea, cf. Clark (1975)). In (1) the informa- 
tion is missing that the CPU is part of the 
(P CI- ) mother-hoard. This apparent relation 
can only be inferred if conceptual knowledge 
about the domain is available. It is obvious 
(see Fig. 1)^ that the concept Cpu is bound 

^The following notational conventions apply to the 
knowledge base for the information technology do- 



in a direct part-of-tjpe relation, viz. has- 
cpu, to the concept Motherboard, while 
its partonomic relation to the instance Lte- 
LlTE-25 is not so tight; a conceptual rela- 
tionship between Cpu and Compaq is ex- 
cluded at the conceptual level, since they are 
not linked via any part-of-tjpe conceptual 
role. 

Nevertheless, part-of-tjpe conceptual roles 
are far too unconstrained to properly dis- 
criminate among several possible antecedents 
in the preceding discourse context. We there- 
fore propose a basic heuristic for concep- 
tual proximity which takes the path length 
between concept pairs into account. It is 
based on the common distinction between 
concepts and relations/roles in classification- 
based terminological reasoning systems (cf. 
MacGregor (1991) for a survey). Concep- 
tual proximity takes only conceptual roles 
into consideration, while it does not con- 
sider the generalization hierarchy between 
concepts. The heuristic can be phrased like 
the following: If connecting role paths be- 
tween the concepts denoted by possible an- 
tecedents and an elliptical expression exist 
via one or more conceptual relations (roles), 
that particular role composition is preferred 
for ellipsis resolution whose path contains 
the least number of roles. If several con- 
nected role paths of equal length exist, then 
functional constraints which are based on 
topic/comment patterns apply for the selec- 
tion of the proper antecedent. Only at this 
level grammatical information from the pre- 
ceding sentence is brought into play (for a 
more precise statement in the terms of the 
underlying grammar, cf. Table 5 in Section 
5). 

main (see Fig. 1): Angular boxes from which double 
arrows emanate contain instances (e.g., Lte-Lite- 
25), while rounded boxes contain generic concepts 
(e.g.. Notebook). Directed unlabelled links relate 
concepts via the isa relation (e.g., NOTEBOOK and 
Computer-System), while links labelled with an 
encircled square represent conceptual roles (in Fig. 
1 only definitional roles occur) whose names and 
value constraint expressions are attached to each 
circle (e.g., Computer-System - has-central-unit - 
Central- Unit, with small italics emphasizing the 
role name). Note in particular that any subconcept 
or instance inherits the conceptual attributes from its 
superconcept or concept class (this is not explicitly 
shown in Fig. 1). 
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Abstract 

A hybrid methodology for the resolution of 
text-level ellipsis is presented in this paper. 
It incorporates conceptual proximity criteria 
applied to ontologically well-engineered do- 
main knowledge bases and an approach to 
centering based on functional topic/comment 
patterns. We state text grammatical predi- 
cates for ellipsis and then turn to the proce- 
dural aspects of their evaluation within the 
framework of an actor-based implementation 
of a lexically distributed parser. 
Keywords: text understanding, text pars- 
ing, text ellipsis, conceptual distance metric, 
topic/comment, centering approach 

1 Introduction 

The work reported in this paper is part of 
a large-scale text understanding system for 
knowledge acquisition from German exposi- 
tory texts. Text phenomena are a particu- 
larly challenging issue for the design of ap- 
propriate parsers, since lacking recognition 
facilities either result in referentially inco- 
hesive or, even worse, invalid text knowl- 
edge representations. In a previous paper 
(Strube and Hahn, 1995), we have already 
dealt with text-level anaphora (e.g., "Jack 
owns a car. It cost him $35,000."), the res- 
olution of which contributes to the con- 
struction of (referentially) valid text knowl- 
edge bases. In this paper we propose a 
methodology for the resolution of text-level 
ellipsis yielding (referentially) cohesive text 
knowledge bases. The phenomena we ad- 
dress (also called functional anaphora) can 
be illustrated by the following sentence pair: 
"Jack owns a car. The tires [of the car] need 



to be changed." ("[•••]" indicates material 
deleted from the surface expression). The 
approach to text ellipsis resolution we pro- 
pose integrates language-independent con- 
ceptual (distance measure) and language- 
dependent functional (topic/comment) con- 
straints based on the centering approach 
(Grosz et al., 1995). 

We explicitly exclude two terminologically 
related problems from our study. First, we 
restrict the consideration of ellipses to their 
textual form, i.e., one that extends over for- 
mal sentence boundaries. This excludes, in 
particular, any constructions which build on 
coordination and corresponding elision phe- 
nomena within the sentence (e.g., ^KJack owns 
a car, /and Jack owns] a house, and [Jack 
ownsy a record shop."). We also exclude cases 
where several lexical "traces" signal ellipti- 
cal expressions, e.g., phenomena underlying 
VP ellipsis (e.g., "Jack owns a car, and so 
does John [own a car]."), an issue of partic- 
ular relevance for the English language but 
almost irrelevant for others such as German. 
These forms of ellipses are usually explained 
in terms of structural, i.e., syntactic phe- 
nomena, viz. the application of proper dele- 
tion, recoverage or copying rules for "par- 
allel" constructions (cf., e.g., Hardt (1992a; 
1992b), Kehler (1993), Grover et al. (1994)). 

Second, ellipses in written texts must 
clearly be distinguished from elliptical 
constructions as they occur in question- 
answering dialogues (e.g., Q: "What is your 
hobbyl" A: "Playing jazz music [is my 
hobby]."; for a treatment of that issue, cf., 
e.g., Weischedel and Sondheimer (1982) and 
Carbonell (1983), who emphasize syntactic 
and semantic strategies for ellipsis resolution. 



